manifold capacity
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (4 more...)
To improve accessibility of the method, we will open source the analysis code, and clarify our
We are grateful to the reviewers for their insightful and constructive comments. This is consistent with the methods of Refs. The "CNN dataset" is adapted from that used in Ref. [14], which we supplement with words from Spoken Wikipedia Corpus (SWC) to diversify the word instances and provide more balanced speaker classes for the speaker trained model.
Generalized Category Discovery via Token Manifold Capacity Learning
Tang, Luyao, Huang, Kunze, Chen, Chaoqi, Chen, Cheng
Generalized category discovery (GCD) is essential for improving deep learning models' robustness in open-world scenarios by clustering unlabeled data containing both known and novel categories. Traditional GCD methods focus on minimizing intra-cluster variations, often sacrificing manifold capacity, which limits the richness of intra-class representations. In this paper, we propose a novel approach, Maximum Token Manifold Capacity (MTMC), that prioritizes maximizing the manifold capacity of class tokens to preserve the diversity and complexity of data. MTMC leverages the nuclear norm of singular values as a measure of manifold capacity, ensuring that the representation of samples remains informative and well-structured. This method enhances the discriminability of clusters, allowing the model to capture detailed semantic features and avoid the loss of critical information during clustering. Through theoretical analysis and extensive experiments on coarse- and fine-grained datasets, we demonstrate that MTMC outperforms existing GCD methods, improving both clustering accuracy and the estimation of category numbers. The integration of MTMC leads to more complete representations, better inter-class separability, and a reduction in dimensional collapse, establishing MTMC as a vital component for robust open-world learning. Code is in github.com/lytang63/MTMC.
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.66)
Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry
Chou, Chi-Ning, Le, Hang, Wang, Yichen, Chung, SueYeon
The ability to integrate task-relevant information into neural representations is a fundamental aspect of both biological and artificial intelligence. To enable theoretical analysis, recent work has examined whether a network learns task-relevant features (rich learning) or resembles a random feature model (or a kernel machine, i.e., lazy learning). However, this simple lazy-versus-rich dichotomy overlooks the possibility of various subtypes of feature learning that emerge from different architectures, learning rules, and data properties. Furthermore, most existing approaches emphasize weight matrices or neural tangent kernels, limiting their applicability to neuroscience because they do not explicitly characterize representations. In this work, we introduce an analysis framework based on representational geometry to study feature learning. Instead of analyzing what are the learned features, we focus on characterizing how task-relevant representational manifolds evolve during the learning process. In both theory and experiment, we find that when a network learns features useful for solving a task, the task-relevant manifolds become increasingly untangled. Moreover, by tracking changes in the underlying manifold geometry, we uncover distinct learning stages throughout training, as well as different learning strategies associated with training hyperparameters, uncovering subtypes of feature learning beyond the lazy-versus-rich dichotomy. Applying our method to neuroscience and machine learning, we gain geometric insights into the structural inductive biases of neural circuits solving cognitive tasks and the mechanisms underlying out-of-distribution generalization in image classification. Our framework provides a novel geometric perspective for understanding and quantifying feature learning in both artificial and biological neural networks.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- (5 more...)
The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models
Kirsanov, Artem, Chou, Chi-Ning, Cho, Kyunghyun, Chung, SueYeon
Decoder-only language models have the ability to dynamically switch between various computational tasks based on input prompts. Despite many successful applications of prompting, there is very limited understanding of the internal mechanism behind such flexibility. In this work, we investigate how different prompting methods affect the geometry of representations in these models. Employing a framework grounded in statistical physics, we reveal that various prompting techniques, while achieving similar performance, operate through distinct representational mechanisms for task adaptation. Our analysis highlights the critical role of input distribution samples and label semantics in few-shot in-context learning. We also demonstrate evidence of synergistic and interfering interactions between different tasks on the representational level. Our work contributes to the theoretical understanding of large language models and lays the groundwork for developing more effective, representation-aware prompting strategies.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (2 more...)
Toward a Geometric Theory of Manifold Untangling
It has been hypothesized that the ventral stream processing for object recognition is based on a mechanism called cortically local subspace untangling. A mathematical abstraction of object recognition by the visual cortex is how to untangle the manifolds associated with different object category. Such a manifold untangling problem is closely related to the celebrated kernel trick in metric space. In this paper, we conjecture that there is a more general solution to manifold untangling in the topological space without artificially defining any distance metric. Geometrically, we can either $embed$ a manifold in a higher dimensional space to promote selectivity or $flatten$ a manifold to promote tolerance. General strategies of both global manifold embedding and local manifold flattening are presented and connected with existing work on the untangling of image, audio, and language data. We also discuss the implications of untangling the manifold into motor control and internal representations.
- North America > United States > West Virginia (0.04)
- North America > United States > Missouri > St. Louis County > St. Louis (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- (2 more...)
Understanding the Logit Distributions of Adversarially-Trained Deep Neural Networks
Seguin, Landan, Ndirango, Anthony, Mishra, Neeli, Chung, SueYeon, Lee, Tyler
Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks. Almost all defense strategies achieve this invariance through adversarial training i.e. training on inputs with adversarial perturbations. Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood. Motivated by a recent study on learning robustness without input perturbations by distilling an AT model, we explore what is learned during adversarial training by analyzing the distribution of logits in AT models. We identify three logit characteristics essential to learning adversarial robustness. First, we provide a theoretical justification for the finding that adversarial training shrinks two important characteristics of the logit distribution: the max logit values and the "logit gaps" (difference between the logit max and next largest values) are on average lower for AT models. Second, we show that AT and standard models differ significantly on which samples are high or low confidence, then illustrate clear qualitative differences by visualizing samples with the largest confidence difference. Finally, we find learning information about incorrect classes to be essential to learning robustness by manipulating the non-max logit information during distillation and measuring the impact on the student's robustness. Our results indicate that learning some adversarial robustness without input perturbations requires a model to learn specific sample-wise confidences and incorrect class orderings that follow complex distributions.
On the geometry of generalization and memorization in deep neural networks
Stephenson, Cory, Padhy, Suchismita, Ganesh, Abhinav, Hui, Yue, Tang, Hanlin, Chung, SueYeon
This part of the gradient behaves similarly for permuted and unpermuted examples. In Eq. 25 we see that the contribution to the label dependent part of the gradient from permuted examples vanishes for large datasets, while the contribution from unpermuted examples does not provided the cross correlation between input features and labels is nonzero. This suggests that with small weight initialization, the gradient descent dynamics initially ignores the labels of permuted examples. Figure A.1 shows a breakdown of how the two components of the gradient computed on both unpermuted and permuted examples evolve over the course of training for the different layers of the VGG16 model trained on CIFAR-100. We see that the label dependent part behaves qualitatively differently for the unpermuted examples than for the permuted examples, as the permuted examples give close to zero contribution early in training in agreement with Eq. 25. The label independent part of the gradient shows similar trends between unpermuted and permuted examples, though in the final epochs, the unpermuted examples have a slightly larger label independent gradient indicating slightly greater model confidence on these examples. As the label dependent and label independent parts of the gradient have differing signs, they compete with each other and cancel when the loss is minimized, but are not independently zero and in fact grow during training. The slightly larger label independent gradient for unpermuted examples is balanced by a corresponding slightly larger label dependent gradient at the end of training.
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.68)